Skip to content

Add tuned HIP GiMMiK preload-C and width variants with non-temporal loads and stores#19

Open
tomjen12 wants to merge 15 commits into
PyFR:masterfrom
tomjen12:hip-gimmik-optimized
Open

Add tuned HIP GiMMiK preload-C and width variants with non-temporal loads and stores#19
tomjen12 wants to merge 15 commits into
PyFR:masterfrom
tomjen12:hip-gimmik-optimized

Conversation

@tomjen12

@tomjen12 tomjen12 commented Jun 18, 2026

Copy link
Copy Markdown

Summary

Adds tuned HIP GiMMiK preload and vector-width variants for PyFR-style matrix multiplication cases.

The new variants include C-preload kernels for cstream/bstream and aligned width-2 preload variants. Existing and new HIP templates now use shared non-temporal load/store helpers for C accesses.

Results

On MI300X PyFR GiMMiK matmul benchmarks across 53 double-precision p2-p5 cases, bandwidth efficiency versus a 4.4 TB/s target improved from:

  • Baseline: 76% min, 84% geomean
  • This branch: 86% min, 94% geomean

Companion PyFR PR: PyFR/PyFR#567

Test plan

  • Ran PyFR HIP GiMMiK matmul correctness checks against NumPy reference results.
  • Ran MI300X double-precision p2-p5 benchmark cases.

@tomjen12 tomjen12 marked this pull request as draft June 18, 2026 05:47
@tomjen12 tomjen12 marked this pull request as ready for review June 18, 2026 05:48
@tomjen12

Copy link
Copy Markdown
Author

Hi @FreddieWitherden , could you please review this PR when you have a chance?
I also noticed I cannot request reviewers from the sidebar. Is reviewer assignment limited to repository members?

if A[j, kx] != 0:
nzixs.append((l_idx, kx))

has_dotp = any(A[j, kx] != 0 for kx in range(k))

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As A is a numpy array this can likely be simplified as A[j].any() (I think!). Same for the above for where we can eliminate the explicit Python loop with something built upon A[j, kbx].


% if width == 2:
static inline __device__ ${dtype}
gimmik_vmul(${dtype[:-1]} a, ${dtype} b)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be worth moving these up into a shared file which other kernels can include via Mako. Keeps all of the vector stuff in one place.

@FreddieWitherden

Copy link
Copy Markdown
Contributor

Overall looks good. Few quick things:

  • Confirm nothing breaks on the consumer SKU's. Performance is not critical here but some of our users use RDNA GPUs for local development and wave32 vs 64 has tripped us up in the past.
  • As we're now emitting more kernels this can have a negative impact on the start-up time of PyFR when they are JIT'ed for the first time. If you have reasonable heuristics it might be adopting the approach of PTX Backend #18 which has JSON files with tuning info (and am working on that PR to get the code moved into the base class so all generators can share it). PyFR passes in the wave_size and GCN arch name (queried from HIP) to GiMMiK which can be used for dispatch. This is especially useful if a certain trick always yields a benefit.

@FreddieWitherden FreddieWitherden self-assigned this Jun 18, 2026
@FreddieWitherden

Copy link
Copy Markdown
Contributor

Couple of other notes (mostly for me):

  • Confirm compatibility with PyFR v3.1 which is the latest released version. This only uses the static n kernel variants so shouldn't have any issues with wide width kernels (since GiMMiK provides the exact block sizes).
  • Confirm no regressions on MI250X which is widely used due to its deployment on supercomputers in the US and Europe. It is fine if performance doesn't improve, but we want to avoid regressions.

@tomjen12

tomjen12 commented Jun 22, 2026

Copy link
Copy Markdown
Author

Thanks for the review and advice.
Updates made based on your comments:

  • Simplified the NumPy row checks using array operations, following the suggestion around A[j].any() and avoiding explicit Python loops where possible.
  • Moved the vector helper code into a shared Mako include so the width/vector logic is kept in one place and reused by the HIP kernels.
  • Confirmed compatibility with PyFR v3.1. The optimized HIP variants pass the correctness checks with PyFR v3.1.
  • Confirmed MI250X (gfx90a) correctness as well. The 22 variants pass accuracy checks on the tested gfx90a cases, and performance also looks improved overall.
  • For dispatch, I am currently gating the tuned HIP variants using the HIP-reported GCN arch and wave size:
    base_arch = gcn_arch.split(':', 1)[0] if gcn_arch else None
    if base_arch not in {'gfx90a', 'gfx942'} or warp_size != 64:
        return
        

This enables the tuned variants for validated gfx90a and gfx942 wave64 targets while avoiding other architectures for now.

  • RDNA / consumer GPU testing is still pending. I am currently looking for access to a suitable machine to verify wave32 behavior before enabling anything there.

@FreddieWitherden

Copy link
Copy Markdown
Contributor

Thanks!

Going through the PR there appear to be three main improvements.

  • Preloading Current GiMMiK kernels attempt to load parts of $C$ on demand, whereas the PR also considers doing the loads all upfront. If this is found to always be beneficial then we probably can eliminate the on-demand paths. (The original thoughts around incremental loading were to get some extra ILP, but I don't think I ever considered preloading.)
  • Non-temporal loads/stores These are used in some guise on NVIDIA already. Again, if we find they are always beneficial we should drop support for the non-temporal path. If they're neutral we can probably go either way. If there are cases where they help and others where they regress we should see if there is a simple heuristic to guide us. For example, on x86 CPUs we use non-temporal stores when the matrix is above a certain size (32 MiB if I recall). Should we keep both it is worth unifying the kernels. This can be done through either inline device functions, classic C preprocessor macros, or passing Python lamdba's to the Mako templates and evaluating them to yield the correct syntax. All of these approaches are fine.
  • Wider kernels This is something we used to have on HIP but it was removed in 07220ea Is there any reason this path can not be brought back as-is? The benefit is that it shares a kernel template with the current code.

All in, if I've understood everything correctly I do not think we need any new templates, just extra parameterization on the templates we have. This tends to make things more manageable and easier to understand so it would be preferable. Again, it is possible I've missed something about the new kernels which really does require a different overall structure, so let me know there.

@FreddieWitherden

Copy link
Copy Markdown
Contributor

Have just run the PR through my benchmark suite (p = 2 to 5, double precision, ~4 GiB of data so we always asymptote) and find the overall performance to be about the same (geomean ratio PR/base: 1.003x). Did your benchmarking also include the rocBLAS kernels? It could be that the wins are coming from kernels where rocBLAS does better than either the new or old GiMMiK kernels.

@tomjen12

tomjen12 commented Jun 23, 2026

Copy link
Copy Markdown
Author

Thanks for checking this.

Could you share a bit more detail about your benchmark setup?

  • What platform/GPU are you using for the test?
  • What ROCm version or Docker image are you using?
  • Could you share the benchmark script or command line you used? I can run the same setup on my side for comparison.
    Also, the new HIP variants are currently gated on validated wave64 targets:
base_arch = gcn_arch.split(':', 1)[0] if gcn_arch else None
if base_arch not in {'gfx90a', 'gfx942'} or warp_size != 64:
    return

Could you confirm whether your test platform reports gfx90a or gfx942 with warp_size == 64? If not, the new tuned variants may not be emitted. As a quick check, could you also try temporarily disabling this arch gate and rerun the benchmark?

@tomjen12

Copy link
Copy Markdown
Author

Regarding the template structure, I will also take another look and see how much can be merged back into the existing HIP templates.

@tomjen12

Copy link
Copy Markdown
Author

Thanks for the detailed feedback.

I have updated the HIP kernels so that width is now a template parameter rather than requiring separate width-specific templates. This lets the existing templates generate the wider variants directly.

For the non-temporal C load/store path, I tested the four basic HIP GiMMiK variants over the 53 PyFR matrix cases. Across 212 per-variant comparisons, the non-temporal path wins 190 cases and gives a +10.51% geomean bandwidth improvement. There are a few per-variant regressions, but for the affected cases other variants remain non-regressing, so the autotuner should still be able to avoid a loss in the final selected kernel. Based on this, I think using the non-temporal C path by default is justified.

image

For preload-C, the result is different. It is not uniformly beneficial: preload-C wins 83/212 comparisons, while the on-demand path wins 129/212, with an overall geomean bandwidth speedup of 0.94x. Some cases do improve substantially, so I think preload-C is still useful to keep as additional autotune candidates, but it should not replace the on-demand variants outright.

image

@FreddieWitherden

Copy link
Copy Markdown
Contributor

Could you confirm whether your test platform reports gfx90a or gfx942 with warp_size == 64? If not, the new tuned variants may not be emitted. As a quick check, could you also try temporarily disabling this arch gate and rerun the benchmark?

It was a caching issue on my end. I can now reproduce your results on an MI300X w/ROCm 7.14.

@FreddieWitherden

Copy link
Copy Markdown
Contributor

Thanks for the detailed feedback.

I have updated the HIP kernels so that width is now a template parameter rather than requiring separate width-specific templates. This lets the existing templates generate the wider variants directly.

Thank you for investigating. Assuming non-temporal loads/stores are widely available (so compiles for older CDNA and RDNA) I think the best course of action is to enable them everywhere all the time for HIP. This simplifies the code and puts us at parity with CUDA where we also ask for quasi-non temporal stores.

In terms of preloading I agree with making this an option in the auto-tuner. Extra kernel variations are not as big of a deal for HIP since the kernels are generic on n whereas on CUDA the sizes are baked into the code so we pay more of a hit.

Only other thing to check is that we yield kernels in a sensible order. To keep start-up times down PyFR can be instructed to only consider the first m kernels from GiMMiK. Hence, it is worth seeing if we can yield them in a sensible-ish order (or better still, if there are combos which never seem to win out, just prune them on our end). No need to strive for perfection here, it is more a "nice to have" rather than something essential.

(The long term goal is to make things adaptive; PyFR when it compiles a kernel passes information about the register use and spillage back to GiMMiK. At the moment this is discarded, but could be useful in the future for things like "don't bother with a width = 2 kernel if width = 1 spills".)

Finally, while it is not a priority at the moment, can you check there are no major regressions single precision.

@tomjen12

Copy link
Copy Markdown
Author

Thanks.

I've removed the nt_c option for HIP. Non-temporal loads/stores for C are now always enabled, which simplifies the HIP path.

I also checked single precision and do not see any major regressions. For the 53 GiMMiK-accessible cases tested, the geometric mean is about a 4% improvement, with the worst case at 98% of the previous baseline, which is likely within run-to-run noise.

For the kernel yield order, I agree this is worth looking at. My colleague is planning to push a follow-up commit with the non-temporal B optimization. Once that lands, I can use the updated tuning data to pick a better order.

On memory-bound operators the B matrix is read once from HBM and reused
only within a work-group via LDS -- it is never re-read across blocks.
A normal global load still allocates B's line in L2, which is pure
overhead: the line is never reused, it evicts genuinely-reusable data,
and it adds cache-allocate/eviction traffic. This is the read-side mirror
of the non-temporal C store we already use.

NTB loads B with a non-temporal load (load_b -> __builtin_nontemporal
path) so B bypasses L2 instead of polluting it. It moves the same number
of bytes but keeps the cache clean, raising effective bandwidth.

Implemented as a flag on the existing templates rather than new files:
- base.mako: add a load_b() wrapper (non-temporal B load).
- bstream-msplit{,-preload-c}.mako: gate the B read behind an `ntload`
  flag (context.get('ntload', False)); renders byte-identically to the
  plain kernel when the flag is absent.
- hip.py: emit `*-ntb` variants by passing ntload=True inside the
  existing width loop, so NTB combines with width (w1/w2) automatically.

Backward-compatible (plain variants unchanged) and CDNA-gated like the
other tuned variants. On MI300X (gfx942) NTB passes the accuracy check
(~1e-15) and wins the autotune in the majority of memory-bound cases
(~+4.5% bandwidth on those), being chosen over the plain bstream-msplit.
@@ -0,0 +1,104 @@
<%inherit file='base'/>

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we merge this into the msplit kernel which preload as an option using % if/else as appropriate to switch between the two?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. I merged the preload-C path into the existing msplit template behind a preload option, and removed the separate preload-C template.

Comment thread gimmik/kernels/hip/base.mako Outdated
{ return 0; }
% endif

% if width == 1:

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we define overloads like the CUDA backend does:

https://github.com/PyFR/GiMMiK/blob/master/gimmik/kernels/cuda/base.mako#L13

Just keeps the code a little cleaner and more consistent.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. I updated the HIP base template to use vector operator overloads in the same style as CUDA. I omitted operator+= since HIP already provides it for vector types and defining it here caused an overload ambiguity.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Is the += a runtime issue that should be reported upstream? Seems odd to provide an overloaded for += but nothing else. (If I recall, earlier versions of ROCm provided a full suite of overloads, but that caused issues with CUDA code compatibility where none are provided, so newer versions switch to not providing overloads. Did += sneak through?)

tomjen12 added 4 commits June 24, 2026 23:00
Reduce the HIP tuned kernel search space from 28 variants to 12 and order the remaining variants to try common winners earlier.
@tomjen12

Copy link
Copy Markdown
Author

I've pushed another round of cleanup based on the review feedback:

  • Non-temporal B loads are now the default. On MI300X, across the 53-case benchmark set, this moves the geomean from ~94% to ~97% of the 4.4 TB/s reference bandwidth.
  • The HIP templates now use natural vector operators, while avoiding custom operator+= overloads due to ambiguity with the HIP runtime definitions.
  • The preload-C variants have been merged into the base templates via a preload option.
  • The tuned HIP search space has been reduced from 28 variants to 12, and the remaining variants are ordered so the commonly winning candidates are tried earlier. With this smaller set, the MI300X benchmark geomean remains at ~97% of the 4.4 TB/s reference bandwidth.

@tomjen12

Copy link
Copy Markdown
Author

Added back bstream-msplit/m4-b24-x64 and cstream-ksplit/k2-c24-x64 because MI355 benchmarks show these baseline variants are still competitive and sometimes faster than the newer tuned variants. Keeping them in the candidate list gives autotune the option to select them when appropriate.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants